Chapter 16 ยท Fault Tolerance

Error Handling & Fault-Tolerant Systems

Errors are not exceptional โ€” they are inevitable. Database queries fail, external APIs time out, users send bad data, and business logic hits edge cases nobody predicted. This chapter builds the mindset and concrete toolkit for detecting, handling, and recovering from every class of backend error โ€” before it silently costs you money or trust.

01

The Fault-Tolerant Mindset

"The question is not whether errors will happen โ€” it is how you will handle them when they do."

Every backend engineer must internalise a simple truth: your system will fail. Not might. Will. The sources are everywhere:

A fault-tolerant system is not one that never breaks. It is one that breaks predictably, recovers gracefully, and tells you exactly what happened. Achieving that requires a deliberate mindset shift from "I'll handle errors later" to "I'll design for failure from day one."

The best error handling starts before the error happens. Proactive detection, health checks, and validation prevent most runtime surprises.
02

The Five Classes of Backend Errors

Backend errors can be grouped into five broad categories. Each has a different origin, detection strategy, and fix.

โ‘  Logic Errors

App runs but produces wrong results. Hardest to detect. Can silently drain money for weeks.

โ‘ก Database Errors

Connection failures, deadlocks, constraint violations, malformed SQL. Can bring the whole app down.

โ‘ข External Service Errors

Third-party APIs (payment, email, auth) time out, rate-limit, or go offline. You have no control.

โ‘ฃ Input Validation Errors

Users send bad, missing, or out-of-range data. Easiest to handle โ€” if your validation layer is robust.

โ‘ค Configuration Errors

Missing env vars, wrong credentials on deploy. Surface at startup โ€” not at runtime, if you do it right.

03

Logic Errors โ€” The Silent Killers

Logic errors are the most dangerous class because your application keeps running โ€” it just does the wrong thing. No crash, no stack trace, no 500 response. Just quietly wrong results accumulating over time.

Classic Example

An e-commerce platform applies a discount twice due to a bug in the promotion engine. The result: negative shipping costs. The app runs perfectly. Every order ships. The company loses money on every transaction. This goes unnoticed for weeks because no monitoring alert fires on "negative shipping cost."

Common Root Causes

Logic errors involving money, permissions, or security can corrupt data and produce wrong business results for months without a single error log entry. They are only found through careful testing, monitoring business metrics, and code review.

Prevention Strategies

04

Database Errors

Most backend applications are meaningless without their database. A database error of any kind means your app cannot serve real data โ€” which usually means a broken UI or cascading failures across services.

โ‘  Connection Errors

Your backend cannot reach the database server. Possible causes:

Connection pooling keeps a fixed number of open TCP connections to the database, avoiding the overhead of a full TCP handshake + TLS negotiation on every query. Tools: pgxpool (Go), psycopg2 pool or SQLAlchemy (Python), pg Pool (Node). Size your pool carefully: too small โ†’ bottleneck; too large โ†’ DB overload.

โ‘ก Constraint Violation Errors

You are trying to perform an operation that violates a database-level rule:

Constraint TypeTriggerAppropriate Response
UniqueInsert a duplicate email / usernameHTTP 409 Conflict or 400 โ€” "Email already in use"
Foreign KeyReference a row that doesn't existHTTP 404 โ€” "Author ID not found" / 400
Not NullMissing required column valueHTTP 400 โ€” "Field X is required"
CheckValue fails a custom rule (e.g. price > 0)HTTP 400 โ€” domain-specific message

โ‘ข Query / Syntax Errors

Malformed SQL โ€” a table name typo, referencing a column that was renamed, or a missing join condition. These are usually caught in development but can slip through if raw SQL strings are built dynamically.

Never construct SQL by string concatenation with user input. Use parameterised queries / prepared statements always. This prevents both query errors and SQL injection attacks.

โ‘ฃ Deadlocks

A deadlock occurs when two (or more) transactions each hold a lock that the other needs:

Transaction A holds LOCK on Row 1 Transaction B holds LOCK on Row 2 A wants Row 2 โ†’ โ† B wants Row 1 DEADLOCK circular wait โ€” DB kills one Tx
Fig 1 โ€” Deadlock: two transactions waiting on each other's locks indefinitely. The DB detects and aborts one.

Postgres detects deadlocks automatically and kills one transaction with error code 40P01. Your application must retry that transaction. Prevention: always acquire locks in a consistent order across all code paths.

05

External Service Errors

Modern SaaS backends depend on a constellation of third-party services โ€” payment processors (Stripe), email (Resend, SendGrid), object storage (S3), auth (Clerk, Auth0), AI (OpenAI). Every one of these is a point of failure outside your control.

Your Backend HTTP / TCP / WS ๐Ÿ’ณ Stripe ๐Ÿ“ง Resend ๐Ÿ” Clerk/Auth0 ๐Ÿ—„๏ธ S3 Storage ๐Ÿค– OpenAI ๐Ÿ“Š Analytics Any of these going down = your problem
Fig 2 โ€” Each external dependency is an independent failure point your app must handle gracefully.

โ‘  Network Failures

The internet between your server and the external API is unreliable. You will encounter: connection timeouts, DNS resolution failures, network partitions, and TLS handshake errors. Set explicit timeouts on every outgoing HTTP call โ€” never let a slow third-party API block your goroutine / thread indefinitely.

โ‘ก Rate Limiting โ€” HTTP 429

Every serious external API enforces rate limits to prevent abuse. If your app hammers an API (due to a bug, a traffic spike, or a loop error), you will receive HTTP 429 Too Many Requests.

The standard mitigation is Exponential Backoff with Jitter:

Try #1 โ†’ 429 1 s Try #2 โ†’ 429 2 s Try #3 โ†’ 429 4 s Try #4 โ†’ 429 8 s Try #5 โ†’ 200 โœ“ Wait = base ร— 2โฟ + random_jitter โ€” doubles each retry, jitter prevents thundering herd
Fig 3 โ€” Exponential Backoff: each retry waits 2ร— longer. Jitter spreads retries to avoid thundering herd.

โ‘ข Service Outage / Downtime

Major cloud providers (AWS, GCP) and popular SaaS services go down occasionally. Your app needs a strategy for when a critical dependency is completely unavailable:

06

Input Validation Errors

These are the easiest errors to handle because you define the rules. Your validation layer is the first line of defence: catch bad data at the entry point, before it reaches your database or business logic.

Types of Validation

TypeWhat It ChecksExample
FormatShape/pattern of the valueEmail regex, ISO date, E.164 phone
RangeNumeric bounds, string length, array sizePrice: 0โ€“99999, name: 2โ€“100 chars
RequiredMandatory field presentuser_id must not be null
Business RuleDomain-specific constraintBooking end_date > start_date
ReferentialRelated entity actually existscategory_id exists in categories table

Always validate at both layers: frontend (UX) and backend (security). Never trust client-side validation alone. The backend is the authoritative gate.

Return HTTP 400 Bad Request with a structured error body listing every field that failed and why. Don't return a single generic message โ€” help the user fix all their mistakes in one round-trip.
07

Configuration Errors

Configuration errors happen at the boundary between environments โ€” dev โ†’ staging โ†’ production. A missing OPENAI_API_KEY, a wrong database URL, or a forgotten secret can silently break specific features while the rest of the app appears healthy.

Fail Fast at Startup โ€” Not at Runtime

The golden rule: validate all required environment variables before the server starts accepting traffic. If any are missing or corrupt, crash immediately with a clear error message.

โŒ Bad โ€” Runtime Failure

  • App starts successfully
  • First user hits the AI image endpoint
  • OpenAI call fails โ€” key is missing
  • User gets a mysterious 500 error
  • Old deployment is already stopped
  • Site is down until manually fixed

โœ… Good โ€” Startup Failure

  • New deployment starts
  • Config validation runs immediately
  • Missing key detected โ†’ process exits with clear message
  • Blue-green: old deployment still running
  • Zero downtime โ€” ops team fixes and redeploys
Blue-green deployments make startup-time crashes safe: the new version must pass health checks before the old version is terminated. If the new version crashes at start, the old version keeps serving traffic.

Go โ€” Config Validation at Boot

Gopackage config

import (
    "fmt"
    "os"
    "strings"
)

type Config struct {
    DatabaseURL    string
    OpenAIKey      string
    JWTSecret      string
    ResendAPIKey   string
}

// MustLoad panics if any required variable is missing.
// Call this once in main() before http.ListenAndServe.
func MustLoad() Config {
    required := []string{
        "DATABASE_URL",
        "OPENAI_API_KEY",
        "JWT_SECRET",
        "RESEND_API_KEY",
    }

    var missing []string
    for _, key := range required {
        if os.Getenv(key) == "" {
            missing = append(missing, key)
        }
    }
    if len(missing) > 0 {
        // Crash immediately โ€” loud and clear
        panic(fmt.Sprintf("[FATAL] missing required env vars: %s",
            strings.Join(missing, ", ")))
    }

    return Config{
        DatabaseURL:  os.Getenv("DATABASE_URL"),
        OpenAIKey:    os.Getenv("OPENAI_API_KEY"),
        JWTSecret:    os.Getenv("JWT_SECRET"),
        ResendAPIKey: os.Getenv("RESEND_API_KEY"),
    }
}

// main.go
func main() {
    cfg := config.MustLoad()  // panics here if config invalid
    server := newServer(cfg)
    log.Fatal(server.ListenAndServe())
}
08

Proactive Error Detection โ€” Health Checks

"The best error handling starts before the error happens."

Health checks continuously verify that your system is working โ€” not just that it is running. There is a critical difference:

Basic Liveness Check โŒ GET /health โ†’ 200 OK Checks: is the process running? Does NOT verify DB / cache / external Deep Readiness Check โœ… GET /ready โ†’ 200 OK Runs test query, pings cache, verifies all subsystems are functional
Fig 4 โ€” Liveness vs Readiness: the industry-standard split (Kubernetes uses both).

What to Check

Go โ€” Deep Health Check Endpoint

Gotype HealthStatus struct {
    Status   string            `json:"status"`
    Checks   map[string]string `json:"checks"`
}

func healthHandler(db *pgxpool.Pool, rdb *redis.Client) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        checks := map[string]string{}
        overall := "ok"

        // DB check
        ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
        defer cancel()
        if err := db.Ping(ctx); err != nil {
            checks["database"] = "unhealthy: " + err.Error()
            overall = "degraded"
        } else {
            checks["database"] = "ok"
        }

        // Redis check
        if err := rdb.Ping(r.Context()).Err(); err != nil {
            checks["cache"] = "unhealthy: " + err.Error()
            overall = "degraded"
        } else {
            checks["cache"] = "ok"
        }

        status := http.StatusOK
        if overall != "ok" { status = http.StatusServiceUnavailable }

        w.Header().Set("Content-Type", "application/json")
        w.WriteHeader(status)
        json.NewEncoder(w).Encode(HealthStatus{Status: overall, Checks: checks})
    }
}
09

Monitoring & Observability

Health checks tell you something is broken right now. Monitoring tells you something is about to break โ€” and gives you the context to understand why something broke after the fact.

What to Track

CategoryMetrics to MonitorWhy
HTTP Layer4xx / 5xx rate, p50/p95/p99 latencySurface user-facing issues immediately
DatabaseQuery duration, connection pool usage, deadlock countDetect slow queries before timeout
External ServicesCall success rate, latency, 429 countKnow when a dependency is degrading
Business MetricsSuccessful transactions/min, failed payments, sign-up rateCatch logic errors invisible to error rates
InfrastructureCPU, memory, disk I/O, network throughputResource exhaustion precedes crashes
Don't only track error rates. A drop in successful transactions from 1000/min to 200/min is a critical problem โ€” even if the error rate shows 0%. Always monitor positive business outcomes, not just failure signals.

Structured Logging (JSON)

Plain-text logs are hard to query at scale. Use structured JSON logs so log aggregation tools (Grafana Loki, Datadog, ELK) can parse, filter, and alert on them programmatically.

Python โ€” structlogimport structlog

log = structlog.get_logger()

# Good โ€” structured, queryable, no sensitive data
log.error(
    "payment_failed",
    user_id="u_9a3f",          # ID, not email
    correlation_id="req_abc123",
    provider="stripe",
    error_code="card_declined",
    amount_cents=4999,
)

# BAD โ€” never log PII or secrets
# log.error("payment_failed", email="alice@example.com", card="4242...")
10

Recovery Strategies

Recoverable vs Non-Recoverable

Recoverable Errors

  • Transient network glitch to email API
  • Database connection pool temporarily exhausted
  • Rate limit 429 from external service

Strategy: Retry with exponential backoff. Queue the work. Don't give up immediately.

Non-Recoverable Errors

  • Redis cluster completely down
  • Payment processor offline for hours
  • Corrupt data in the DB

Strategy: Graceful degradation. Fallback. Disable the feature. Protect core functionality.

Exponential Backoff in Go

Gofunc sendEmailWithRetry(to, subject, body string) error {
    maxRetries := 5
    baseDelay  := 1 * time.Second

    for attempt := 0; attempt < maxRetries; attempt++ {
        err := emailClient.Send(to, subject, body)
        if err == nil {
            return nil  // success
        }

        if !isRetryable(err) {
            return fmt.Errorf("permanent failure: %w", err)
        }

        // Exponential backoff: 1s, 2s, 4s, 8s, 16s
        wait := baseDelay * time.Duration(1<<attempt)
        // Add jitter (ยฑ20%) to prevent thundering herd
        jitter := time.Duration(rand.Int63n(int64(wait / 5)))
        time.Sleep(wait + jitter)

        log.Warn("email send failed, retrying",
            "attempt", attempt+1,
            "wait_ms", wait.Milliseconds(),
            "error", err)
    }
    return fmt.Errorf("all %d retries exhausted", maxRetries)
}

func isRetryable(err error) bool {
    // Retry on 429, 503, network errors; not on 400, 401, 422
    var httpErr *HTTPError
    if errors.As(err, &httpErr) {
        return httpErr.StatusCode == 429 || httpErr.StatusCode >= 500
    }
    return true  // network errors are always retryable
}

Automatic vs Manual Recovery

Data integrity is your #1 priority during any incident. Never auto-delete or auto-migrate data as part of an error recovery procedure. Take a backup first, always.
11

Global Error Handler โ€” The Final Safety Net

The global error handler is a single middleware that sits at the outermost layer of your application, catches every error that bubbles up from any layer, and converts it into a properly formatted HTTP response.

Request Flow Router Handler โ†’ ValidationError (400) Service โ†’ BusinessError Repository โ†’ DBError (unique, norows, fk) errors bubble up Global Error Handler Middleware ValidationError โ†’ HTTP 400 + field errors DB UniqueViolation โ†’ HTTP 409 Conflict DB NoRows โ†’ HTTP 404 Not Found Unknown Error โ†’ HTTP 500 + generic msg HTTP Response 400 Bad Request 409 Conflict 404 Not Found 500 Internal Error Safe, user-friendly messages only. No internal details.
Fig 5 โ€” All errors bubble up to one middleware. One place to define every response format.

Two Major Advantages

12

Go โ€” Global Error Handler Implementation

Go doesn't have exceptions โ€” errors are return values. The pattern is to return errors up the call stack and handle them in middleware.

Go โ€” errors/types.gopackage apperr

import "net/http"

// AppError is the canonical error type for this application.
type AppError struct {
    Code    int      // HTTP status code
    Message string   // Safe, user-facing message
    Details any      // Optional: field-level errors for 400s
    Err     error    // Original error โ€” for logging only, NEVER sent to client
}

func (e *AppError) Error() string { return e.Message }

// Constructors
func NotFound(resource string) *AppError {
    return &AppError{Code: http.StatusNotFound, Message: resource + " not found"}
}
func Conflict(msg string) *AppError {
    return &AppError{Code: http.StatusConflict, Message: msg}
}
func BadRequest(msg string, details any) *AppError {
    return &AppError{Code: http.StatusBadRequest, Message: msg, Details: details}
}
func Internal(err error) *AppError {
    return &AppError{
        Code:    http.StatusInternalServerError,
        Message: "something went wrong",  // NEVER expose err.Error() here
        Err:     err,
    }
}
Go โ€” middleware/error_handler.gopackage middleware

import (
    "encoding/json"
    "errors"
    "log/slog"
    "net/http"

    "github.com/jackc/pgx/v5/pgconn"
    apperr "yourapp/errors"
)

type ErrorResponse struct {
    Code    int    `json:"code"`
    Message string `json:"message"`
    Details any    `json:"details,omitempty"`
}

// GlobalErrorHandler wraps a handler that returns an error.
func GlobalErrorHandler(next func(http.ResponseWriter, *http.Request) error) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        err := next(w, r)
        if err == nil { return }

        var appErr *apperr.AppError

        switch {
        // Already wrapped as AppError
        case errors.As(err, &appErr):
            if appErr.Err != nil {
                slog.Error("app error", "err", appErr.Err)
            }

        // Postgres unique constraint violation โ†’ 409
        case isPgError(err, "23505"):
            appErr = apperr.Conflict("resource already exists")

        // Postgres foreign key violation โ†’ 404
        case isPgError(err, "23503"):
            appErr = apperr.NotFound("referenced resource")

        // pgx no-rows โ†’ 404
        case errors.Is(err, pgx.ErrNoRows):
            appErr = apperr.NotFound("resource")

        // Everything else โ†’ 500 (never leak internal error)
        default:
            slog.Error("unhandled error", "err", err)
            appErr = apperr.Internal(err)
        }

        w.Header().Set("Content-Type", "application/json")
        w.WriteHeader(appErr.Code)
        json.NewEncoder(w).Encode(ErrorResponse{
            Code:    appErr.Code,
            Message: appErr.Message,
            Details: appErr.Details,
        })
    }
}

func isPgError(err error, code string) bool {
    var pgErr *pgconn.PgError
    return errors.As(err, &pgErr) && pgErr.Code == code
}
13

Python โ€” Global Error Handler (FastAPI)

Python โ€” FastAPIfrom fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
from psycopg2 import errors as pg_errors
import logging

app = FastAPI()
logger = logging.getLogger("app")

# --- Custom exception types ---
class AppError(Exception):
    def __init__(self, status: int, message: str, details=None):
        self.status  = status
        self.message = message
        self.details = details

class NotFoundError(AppError):
    def __init__(self, resource: str):
        super().__init__(404, f"{resource} not found")

class ConflictError(AppError):
    def __init__(self, msg: str):
        super().__init__(409, msg)

# --- Global exception handlers ---
@app.exception_handler(AppError)
async def app_error_handler(request: Request, exc: AppError):
    return JSONResponse(
        status_code=exc.status,
        content={"code": exc.status, "message": exc.message, "details": exc.details}
    )

@app.exception_handler(pg_errors.UniqueViolation)
async def unique_violation_handler(request: Request, exc):
    logger.warning("unique_violation", extra={"path": request.url.path})
    return JSONResponse(status_code=409,
        content={"code": 409, "message": "resource already exists"})

@app.exception_handler(pg_errors.ForeignKeyViolation)
async def fk_violation_handler(request: Request, exc):
    return JSONResponse(status_code=404,
        content={"code": 404, "message": "referenced resource not found"})

@app.exception_handler(Exception)
async def unhandled_error_handler(request: Request, exc: Exception):
    # Log the real error internally, never expose it
    logger.error("unhandled_exception", exc_info=exc,
        extra={"path": request.url.path})
    return JSONResponse(status_code=500,
        content={"code": 500, "message": "something went wrong"})
14

Security โ€” What to Expose, What to Hide

Every error message that leaves your backend is a potential information leak. Treat your error responses with the same care as your API responses.

โ‘  Never Leak Internal Details

Database error messages from Postgres contain table names, column names, index names, and constraint names. If you forward a raw pgconn.PgError message directly to the client, an attacker learns your schema and can craft more targeted SQL injection attempts.

What You GotWhat to Send to Client
duplicate key value violates unique constraint "users_email_key""Email already in use"
relation "usres" does not exist (typo)"Something went wrong"
deadlock detected on relation 42816"Something went wrong, please retry"
stack trace: panic at server.go:142"Internal server error"

โ‘ก Vague Auth Errors (On Purpose)

Login endpoints are the most attacked surface in any application. If you return specific messages like "no user with this email exists" vs "password is incorrect", an attacker can enumerate valid emails through a simple loop.

๐Ÿ”ด Attacker's Strategy (when you use specific errors) Step 1: Email Enum Try many emails with fake pwd โ†’ "user not found" until hit Step 2: Got Valid Email Now tries common passwords โ†’ "wrong password" on each miss Step 3: Account Compromised Your user's account is now owned โœ… The Fix: Always use a generic message { "message": "invalid email or password" } โ† same message always
Fig 6 โ€” Specific auth error messages enable email enumeration attacks. Use a single generic message for all auth failures.

โ‘ข Safe Logging Practices

Logs are often shipped to third-party aggregation services (Datadog, Grafana Cloud, ELK). In major data breaches, leaked log files exposed millions of records โ€” because engineers had carelessly logged sensitive fields.

Go โ€” safe vs unsafe logging// โŒ UNSAFE โ€” never do this
slog.Error("login_failed",
    "email", user.Email,          // PII leak
    "password", req.Password,     // catastrophic
    "api_key", cfg.OpenAIKey,      // secret leak
)

// โœ… SAFE โ€” IDs and correlation only
slog.Error("login_failed",
    "user_id", user.ID,
    "correlation_id", r.Header.Get("X-Request-ID"),
    "reason", "invalid_credentials",  // generic code, not DB message
)
Follow the OWASP Authentication Cheat Sheet for authentication endpoint security. It covers error message handling, brute-force protection, account lockout, and more.
15

References & Further Reading


Backend Field Manual ยท Error Handling & Fault Tolerance ยท Chapter 16